%HTFETs in General purpose architectures

%What are spaces in which HTFETs can complement CMOS -- what are new design points it opens up?
%Highly parallel systems  -- highly scalable parallelism -- more energy efficient at low voltage/frequency.
%New microarchitectural complexity  -- ILP vs TLP
%We then demonstrate the tradeoffs involved in using such CMOS and HTFET based processors for general purpose computing.

General purpose processors are meant to cater to a large diversity of applications.
These applications may vary in their scalability with respect to number of cores (thread-level parallelism), in their utilization of available processor or memory resources or in their sensitivity to peak single-threaded performance.
In addition, the evaluation metrics and constraints may vary depending on the application domain.
For instance optimizing energy and consequently battery life or minimizing power for a given performance baseline may be more important than peak performance in a power constrained domain such as mobile processors.
On the other hand, high end server domain processors aim to maximize performance amidst controlling temperature in order to minimize cooling costs. 
These hybrid power-performance-temperature metrics opens up a huge design space in the micro-architecture and architecture domains.
As described in Section~\ref{sec:introduction}, CMOS and HTFET processors are capable of optimal operation in different regions of this design space.
Although the high frequency operation of HTFET processors is limited in comparison to CMOS which prevents HTFET devices from being a direct replacement for CMOS technology, intelligent device-architecture co-design can enable us to bridge this gap in performance.

\subsection{Power and Thermally constrained application scheduling}
Power constrained performance optimization in the context of a CMOS-HTFET heterogeneous multicore has been explored in~\cite{codes12,ieeemicro-tfet}. 
These works assume a uniform simple microarchitecture across all cores.
%I dont like this sentence
This reduces the likelihood of thermal hotspots developing due to asymmetric microarchitectures and makes power capping feasible.

On the other hand, when microarchitectures vary across cores, merely capping the overall power and limiting the entire power consumption to a small fraction of the chip in itself does not adequately address the thermal concerns resulting from the increasing power density problem.
The Thermal Design Power (TDP) of a processor chip, defined as the power which the processor can dissipate without exceeding the maximum allowable chip temperature, is used as a metric to determine the power budget of processors.
%Dissipating all the power in a smaller area, even with large cool dark silicon around it, causes a significant increase in peak temperature due to higher power density.
Dissipating all the power in a smaller area causes a significant increase in peak temperature due to higher power density.
Hence, one should also take into account the wide range of application domains in which the processor can be utilized.
These domains can be effectively characterized by the thermal limit that they entail.
For instance, a mobile-based ARM-like embedded core operates under a much more stringent temperature limit than a Xeon-based server architecture.
In the former case, CMOS cores are forced to operate at sub-optimal frequencies with limited microarchitecture flexibility.
This provides opportunities for HTFETs, which, being more power and consequently thermal-efficient at these temperatures can operate over a much wider range of microarchitecture complexities. 
Thus HTFETs can attain more optimal states in the frequency-issue-width design space.
%What are the other results I could include?

\begin{figure}[ht!]
\centering
\epsfig{file=figs/freq_complexity_all.eps, angle=0, width=1\linewidth, clip=}
\caption{\label{fig:freq-complexity-all} Permissible states in the frequency-issue-width design space for CMOS and HTFET processors at a) 330K, b) 340K and c) 350K temperature limits}
\end{figure}	

\begin{comment}
\begin{figure}[ht!]
\centering
\begin{minipage}[c]{1\linewidth}
\epsfig{file=figs/single_performance.eps, angle=0, width=1\linewidth, clip=}
\caption{\label{fig:single-performance} Speedup due to scheduling application on a heterogeneous multicore w.r.t a homogeneous CMOS configuration for thermal limits of 330K, 340K and 350K}
\end{minipage}
\end{figure}	
\vspace{0.1in}
\begin{figure}[ht!]
\centering
\begin{minipage}[c]{1\linewidth}
\epsfig{file=figs/single_edp.eps, angle=0, width=1\linewidth, clip=}
\caption{\label{fig:single-edp} EDP due to scheduling application on a heterogeneous multicore w.r.t a homogeneous CMOS configuration for thermal limits of 330K, 340K and 350K}
\end{minipage}
\end{figure}

Figure~\ref{fig:single-performance} shows the variation in speedup across different \emph{parsec} benchmarks for temperatures of 330K, 340K and 350K, while Figure~\ref{fig:single-edp} shows the corresponding EDP variation.
As observed in the figures, the benefit of using HTFET processors reduces with increase in temperature.
This is because, at higher temperatures, the disparity in the microarchitecture flexibility between HTFET and CMOS cores reduces, which makes CMOS cores the preferred choice due to their higher operating frequency.
\end{comment}

Figures~\ref{fig:freq-complexity-all} shows the possible configurations that can be attained under different thermal budgets by CMOS and HTFET processors. 
This figure clearly demonstrates that HTFETs can operate at higher issue widths at lower temperatures, while CMOS cores can reach higher operating frequency as the thermal limit is increased.

The results from all our tradeoff comparisons at the architecture and microarchitecture level are summarized in Tables~\ref{tab:archresults},~\ref{tab:uarchresults-speedup} and~\ref{tab:uarchresults-edp}.
Table~\ref{tab:archresults} shows the performance improvement obtained by replacing a homogeneous multicore system with a heterogeneous CMOS-HTFET configuration, under a 1~W per core power budget. 
The advantages of task scheduling and power partitioning techniques over a naive technology substitution on the multicore are also highlighted.

Table~\ref{tab:uarchresults-speedup} shows the speedup obtained using different microarchitectures in CMOS and HTFET processors under different thermal constraints. 
It can be observed that HTFET cores actually perform worse than CMOS cores at 350K.
Similarly Table~\ref{tab:uarchresults-edp} shows the Energy-Delay Product (EDP) of the best performing HTFET and CMOS core configurations at the above temperature limits. 
The disparity in the EDP between CMOS and HTFET cores reduces as the thermal limit increases, until CMOS is more energy efficient at thermal limits in excess of 350K.
These results clearly illustrate the diminishing returns that HTFETs demonstrate as we transition to higher temperature domains.

\begin{table}[ht!]
\begin{minipage}[c]{1\linewidth}
\begin{center}
%\vspace{-0.2cm}
\begin{tabular}{|c|c|c|} \hline
Processor & Application Type  &Normalized	\\
Configuration & /Power Mapping & Speedup \\\hline
&& \\
Homogeneous multicore & Static & 1.0	\\
(CMOS or HTFET) & & \\\hline
&& \\
Heterogeneous multicore & Static & 1.05	\\
(CMOS-HTFET) & & \\\hline
&& \\
Heterogeneous multicore & Dynamic & 1.22	\\
(CMOS-HTFET) & & \\\hline
\end{tabular}
\vspace{0.1in}
\caption {Power Constrained Application Mapping on a CMOS-HTFET heterogeneous multicore, with a 1~W/core power budget}
\label{tab:archresults}
%\vspace{-0.2in}
\end{center}
\end{minipage}
\end{table}

\vspace{0.1in}

\begin{table}[ht!]
\begin{center}
%\vspace{-0.2cm}
\begin{tabular}{|c||c||c|c|c|} \hline
%Top line of heading
\multicolumn{1}{|c||}{Processor} &
\multicolumn{1}{c||}{Type of} &
\multicolumn{3}{c|}{Normalized} \\

%Middle line
\multicolumn{1}{|c||}{Configuration} &
\multicolumn{1}{c||}{Benchmark} &
\multicolumn{3}{c|}{Speedup} \\ \cline{3-5}

&&&&\\
%Bottom line
\multicolumn{1}{|c||}{} &
\multicolumn{1}{c||}{} &
\multicolumn{1}{c|}{330K} &
\multicolumn{1}{c|}{340K} &
\multicolumn{1}{c|}{350K} \\ \hline

Heterogeneous & Single & & & \\ 
CMOS-HTFET & threaded &1.46 & 1.19 & 1.01\\ 
Multicore & /programmed & & &\\\hline
Heterogeneous & Multi & & & \\ 
CMOS-HTFET & threaded &1.26 & 1.12 & 0.84\\ 
Multicore & /programmed & & &\\\hline
\end{tabular}
\vspace{0.1in}
\caption {Speedup w.r.t a homogeneous CMOS (or HTFET) baseline for single and multithreaded workloads for different temperature limits}
\label{tab:uarchresults-speedup}
%\vspace{-0.2in}
\end{center}
\end{table}

\begin{table}[ht!]
\begin{center}
%\vspace{-0.2cm}
\begin{tabular}{|c||c||c|c|c|} \hline
%Top line of heading
\multicolumn{1}{|c||}{Processor} &
\multicolumn{1}{c||}{Type of} &
\multicolumn{3}{c|}{Normalized} \\

%Middle line
\multicolumn{1}{|c||}{Configuration} &
\multicolumn{1}{c||}{Benchmark} &
\multicolumn{3}{c|}{Speedup} \\ \cline{3-5}
&&&&\\
%Bottom line
\multicolumn{1}{|c||}{} &
\multicolumn{1}{c||}{} &
\multicolumn{1}{c|}{330K} &
\multicolumn{1}{c|}{340K} &
\multicolumn{1}{c|}{350K} \\ \hline

Heterogeneous & Single & & & \\ 
CMOS-HTFET & threaded &0.41 &0.65  & 0.80\\ 
Multicore & /programmed & & &\\\hline
Heterogeneous & Multi & & & \\ 
CMOS-HTFET & threaded & 0.45 & 0.68 & 1.11\\ 
Multicore & /programmed & & &\\\hline
\end{tabular}
\vspace{0.1in}
\caption {Normalized Energy-Delay Product w.r.t a homogeneous CMOS (or HTFET) baseline for single and multithreaded workloads for different temperature limits}
\label{tab:uarchresults-edp}
%\vspace{-0.2in}
\end{center}
\end{table}

Although HTFET cores for general-purpose processing may have an
advantage over CMOS only in domains with tighter thermal constraints,
they can still play a valuable role in augmenting high-end devices by
providing very efficient specialized coprocessors.  Over the last few
processor generations, customizing architectures with logic optimized
for domain-specific applications, such as graphics, multimedia, or
cryptography kernels, has gained importance alongside traditional
process-shrinking based improvements in processor performance. In the
following section, we examine the viability of using HTFET-based
accelerators as an energy efficient alternative to conventional
technology without compromising performance.

% LocalWords:  CMOS HTFET multicore microarchitecture microarchitectures HTFETs
% LocalWords:  tradeoff
